Abstract: The Main Objective of EC web is to crawl relevant forum content from the web with minimal overhead. Forum threads contain information content that is the target of forum crawlers. Each forum have different layouts or styles and are powered by different forum software packages, they always have similar implicit navigation paths connected by specific URL types to lead users from entry pages to thread pages. We reduce the web forum crawling problem to a URL-type recognition problem. Training sets are created by learning accurate and effective regular expression patterns. We have applied this knowledge on unseen URL’s and identified the type of that URL. After the classification all crawled URL’s are stored in a log. URL log is used to identify strong and weak URL’s by eliminating the duplicate URL’s from the URL log. Effectiveness of the strong URL will be measured finally.

Keywords: Effective Crawling, Web Forum, URL Type Recognition Module, Crawling Module.